AdEase Case Study¶
Gautam Naik (gautamnaik1994@gmail.com)
Github Link: https://github.com/gautamnaik1994/AdEaseTimeSeriesForecastingCaseStudy
About AdEase:
AdEase is a marketing and advertising company that helps businesses maximize clicks at minimal cost. It serves as an advertising infrastructure, enabling businesses to promote themselves effectively, economically, and with ease. The interplay of three AI modules—Design, Dispense, and Decipher—unifies to create an end-to-end, three-step digital advertising solution accessible to all.
Business Problem
The Data Science team at AdEase aims to comprehend the per-page view report for various Wikipedia pages over a span of 550 days. Their objective is to forecast the number of views to optimize ad placement for clients. The team has access to data from 145,000 Wikipedia pages, including daily view counts for each page. Since AdEase's clients operate in diverse regions, they require insights into how their ads will perform on pages in different languages.
Using the forecasted data, we can find out the best time and language to place the ads to maximize the number of views.
Metrics
Following metric will be used to evaluate the model performance:
- Mean Absolute Percentage Error (MAPE)
Data
The data is provided in the form of a csv file with the following columns:
- Page: The name of the wikipedia page
- Date: The date of the view count
- Views: The number of views on that date
Forecasting Model
Following models will be used to forecast the number of views:
- ARIMA
- SARIMAX
- Prophet
import random
from datetime import timedelta
import warnings
import numpy as np
import pandas as pd
import polars as pl
import duckdb as db
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tools.sm_exceptions import ConvergenceWarning
from sklearn.metrics import (
mean_squared_error as mse,
mean_absolute_error as mae,
mean_absolute_percentage_error as mape
)
import optuna
from prophet import Prophet
pl.Config(fmt_str_lengths=100)
sns.set_theme(style="whitegrid")
warnings.simplefilter("ignore", UserWarning)
warnings.filterwarnings("ignore", category=UserWarning, module="statsmodels.tsa.base.tsa_model")
warnings.filterwarnings("ignore", category=ConvergenceWarning, module="statsmodels.base.model")
columns = ['Spanish', 'Japanese', 'English', 'Chinese', 'German', 'Russian', 'French']
Data Loading¶
df = pl.read_csv("../data/raw/train_1.csv", has_header=True, ignore_errors=True)
lang_df = pl.read_csv("../data/processed/langcodes.csv", has_header=True, ignore_errors=True)
df
| Page | 2015-07-01 | 2015-07-02 | 2015-07-03 | 2015-07-04 | 2015-07-05 | 2015-07-06 | 2015-07-07 | 2015-07-08 | 2015-07-09 | 2015-07-10 | 2015-07-11 | 2015-07-12 | 2015-07-13 | 2015-07-14 | 2015-07-15 | 2015-07-16 | 2015-07-17 | 2015-07-18 | 2015-07-19 | 2015-07-20 | 2015-07-21 | 2015-07-22 | 2015-07-23 | 2015-07-24 | 2015-07-25 | 2015-07-26 | 2015-07-27 | 2015-07-28 | 2015-07-29 | 2015-07-30 | 2015-07-31 | 2015-08-01 | 2015-08-02 | 2015-08-03 | 2015-08-04 | 2015-08-05 | … | 2016-11-25 | 2016-11-26 | 2016-11-27 | 2016-11-28 | 2016-11-29 | 2016-11-30 | 2016-12-01 | 2016-12-02 | 2016-12-03 | 2016-12-04 | 2016-12-05 | 2016-12-06 | 2016-12-07 | 2016-12-08 | 2016-12-09 | 2016-12-10 | 2016-12-11 | 2016-12-12 | 2016-12-13 | 2016-12-14 | 2016-12-15 | 2016-12-16 | 2016-12-17 | 2016-12-18 | 2016-12-19 | 2016-12-20 | 2016-12-21 | 2016-12-22 | 2016-12-23 | 2016-12-24 | 2016-12-25 | 2016-12-26 | 2016-12-27 | 2016-12-28 | 2016-12-29 | 2016-12-30 | 2016-12-31 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | … | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
| "2NE1_zh.wikipedia.org_all-access_spider" | 18 | 11 | 5 | 13 | 14 | 9 | 9 | 22 | 26 | 24 | 19 | 10 | 14 | 15 | 8 | 16 | 8 | 8 | 16 | 7 | 11 | 10 | 20 | 18 | 15 | 14 | 49 | 10 | 16 | 18 | 8 | 5 | 9 | 7 | 13 | 9 | … | 204 | 14 | 45 | 33 | 28 | 18 | 14 | 47 | 15 | 14 | 18 | 20 | 14 | 16 | 14 | 20 | 60 | 22 | 15 | 17 | 19 | 18 | 21 | 21 | 47 | 65 | 17 | 32 | 63 | 15 | 26 | 14 | 20 | 22 | 19 | 18 | 20 |
| "2PM_zh.wikipedia.org_all-access_spider" | 11 | 14 | 15 | 18 | 11 | 13 | 22 | 11 | 10 | 4 | 41 | 65 | 57 | 38 | 20 | 62 | 44 | 15 | 10 | 47 | 24 | 17 | 22 | 9 | 39 | 13 | 11 | 12 | 21 | 19 | 9 | 15 | 33 | 8 | 8 | 7 | … | 11 | 20 | 13 | 19 | 621 | 57 | 17 | 23 | 19 | 21 | 47 | 28 | 22 | 22 | 65 | 27 | 17 | 17 | 13 | 9 | 18 | 22 | 17 | 15 | 22 | 23 | 19 | 17 | 42 | 28 | 15 | 9 | 30 | 52 | 45 | 26 | 20 |
| "3C_zh.wikipedia.org_all-access_spider" | 1 | 0 | 1 | 1 | 0 | 4 | 0 | 3 | 4 | 4 | 1 | 1 | 1 | 6 | 8 | 6 | 4 | 5 | 1 | 2 | 3 | 8 | 8 | 6 | 6 | 2 | 2 | 3 | 2 | 4 | 3 | 3 | 5 | 3 | 5 | 4 | … | 4 | 0 | 1 | 4 | 5 | 8 | 8 | 1 | 1 | 2 | 5 | 3 | 3 | 3 | 7 | 3 | 9 | 8 | 3 | 210 | 5 | 4 | 6 | 2 | 2 | 4 | 3 | 3 | 1 | 1 | 7 | 4 | 4 | 6 | 3 | 4 | 17 |
| "4minute_zh.wikipedia.org_all-access_spider" | 35 | 13 | 10 | 94 | 4 | 26 | 14 | 9 | 11 | 16 | 16 | 11 | 23 | 145 | 14 | 17 | 85 | 4 | 30 | 22 | 9 | 10 | 11 | 7 | 7 | 11 | 9 | 11 | 44 | 8 | 14 | 19 | 10 | 17 | 17 | 10 | … | 10 | 14 | 17 | 11 | 9 | 11 | 5 | 10 | 8 | 17 | 13 | 23 | 40 | 16 | 17 | 41 | 17 | 8 | 9 | 18 | 12 | 12 | 18 | 13 | 18 | 23 | 10 | 32 | 10 | 26 | 27 | 16 | 11 | 17 | 19 | 10 | 11 |
| "52_Hz_I_Love_You_zh.wikipedia.org_all-access_spider" | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | 14 | 16 | 9 | 178 | 64 | 12 | 10 | 11 | 6 | 8 | 7 | 9 | 8 | 5 | 11 | 8 | 4 | 15 | 5 | 8 | 8 | 6 | 7 | 15 | 4 | 11 | 7 | 48 | 9 | 25 | 13 | 3 | 11 | 27 | 13 | 36 | 10 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "Underworld_(serie_de_películas)_es.wikipedia.org_all-access_spider" | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | 13 | 12 | 13 | 3 | 5 | 10 |
| "Resident_Evil:_Capítulo_Final_es.wikipedia.org_all-access_spider" | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
| "Enamorándome_de_Ramón_es.wikipedia.org_all-access_spider" | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
| "Hasta_el_último_hombre_es.wikipedia.org_all-access_spider" | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
| "Francisco_el_matemático_(serie_de_televisión_de_2017)_es.wikipedia.org_all-access_spider" | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | … | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null |
df.describe()
| statistic | Page | 2015-07-01 | 2015-07-02 | 2015-07-03 | 2015-07-04 | 2015-07-05 | 2015-07-06 | 2015-07-07 | 2015-07-08 | 2015-07-09 | 2015-07-10 | 2015-07-11 | 2015-07-12 | 2015-07-13 | 2015-07-14 | 2015-07-15 | 2015-07-16 | 2015-07-17 | 2015-07-18 | 2015-07-19 | 2015-07-20 | 2015-07-21 | 2015-07-22 | 2015-07-23 | 2015-07-24 | 2015-07-25 | 2015-07-26 | 2015-07-27 | 2015-07-28 | 2015-07-29 | 2015-07-30 | 2015-07-31 | 2015-08-01 | 2015-08-02 | 2015-08-03 | 2015-08-04 | … | 2016-11-25 | 2016-11-26 | 2016-11-27 | 2016-11-28 | 2016-11-29 | 2016-11-30 | 2016-12-01 | 2016-12-02 | 2016-12-03 | 2016-12-04 | 2016-12-05 | 2016-12-06 | 2016-12-07 | 2016-12-08 | 2016-12-09 | 2016-12-10 | 2016-12-11 | 2016-12-12 | 2016-12-13 | 2016-12-14 | 2016-12-15 | 2016-12-16 | 2016-12-17 | 2016-12-18 | 2016-12-19 | 2016-12-20 | 2016-12-21 | 2016-12-22 | 2016-12-23 | 2016-12-24 | 2016-12-25 | 2016-12-26 | 2016-12-27 | 2016-12-28 | 2016-12-29 | 2016-12-30 | 2016-12-31 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | … | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
| "count" | "145063" | 124323.0 | 124247.0 | 124519.0 | 124409.0 | 124404.0 | 124580.0 | 124399.0 | 124769.0 | 124819.0 | 124721.0 | 124538.0 | 124578.0 | 124664.0 | 124923.0 | 124957.0 | 125076.0 | 125015.0 | 124768.0 | 124921.0 | 125084.0 | 125375.0 | 125490.0 | 125482.0 | 125470.0 | 125474.0 | 125198.0 | 125304.0 | 125510.0 | 125716.0 | 125471.0 | 125355.0 | 125423.0 | 125219.0 | 125221.0 | 125468.0 | … | 140691.0 | 141284.0 | 140647.0 | 141084.0 | 140992.0 | 140834.0 | 141003.0 | 140820.0 | 141388.0 | 140773.0 | 140832.0 | 140960.0 | 140933.0 | 141101.0 | 140884.0 | 141438.0 | 141482.0 | 141525.0 | 141261.0 | 140955.0 | 140985.0 | 141497.0 | 141504.0 | 141397.0 | 141411.0 | 141795.0 | 141827.0 | 141210.0 | 141479.0 | 141874.0 | 141319.0 | 141145.0 | 141362.0 | 141241.0 | 141237.0 | 141428.0 | 141598.0 |
| "null_count" | "0" | 20740.0 | 20816.0 | 20544.0 | 20654.0 | 20659.0 | 20483.0 | 20664.0 | 20294.0 | 20244.0 | 20342.0 | 20525.0 | 20485.0 | 20399.0 | 20140.0 | 20106.0 | 19987.0 | 20048.0 | 20295.0 | 20142.0 | 19979.0 | 19688.0 | 19573.0 | 19581.0 | 19593.0 | 19589.0 | 19865.0 | 19759.0 | 19553.0 | 19347.0 | 19592.0 | 19708.0 | 19640.0 | 19844.0 | 19842.0 | 19595.0 | … | 4372.0 | 3779.0 | 4416.0 | 3979.0 | 4071.0 | 4229.0 | 4060.0 | 4243.0 | 3675.0 | 4290.0 | 4231.0 | 4103.0 | 4130.0 | 3962.0 | 4179.0 | 3625.0 | 3581.0 | 3538.0 | 3802.0 | 4108.0 | 4078.0 | 3566.0 | 3559.0 | 3666.0 | 3652.0 | 3268.0 | 3236.0 | 3853.0 | 3584.0 | 3189.0 | 3744.0 | 3918.0 | 3701.0 | 3822.0 | 3826.0 | 3635.0 | 3465.0 |
| "mean" | null | 1195.856567 | 1204.003638 | 1133.675969 | 1170.437324 | 1217.7693 | 1290.27319 | 1239.136754 | 1193.092122 | 1197.991772 | 1189.651197 | 1166.807553 | 1296.184431 | 1300.250971 | 1297.203918 | 1216.332298 | 1229.480756 | 1211.808447 | 1168.70482 | 1221.031628 | 1258.79142 | 1197.782429 | 1191.663854 | 1148.396296 | 1086.95642 | 1067.564045 | 1148.721873 | 1184.4078 | 1164.048076 | 1285.768224 | 1089.086897 | 1050.592501 | 1014.091156 | 1111.009408 | 1097.332205 | 1207.094247 | … | 1326.709107 | 1472.107224 | 1583.903133 | 1460.719082 | 1533.142157 | 1400.355717 | 1389.007191 | 1349.266773 | 1440.036531 | 1525.511597 | 1513.758528 | 1495.34192 | 1444.830664 | 1416.289162 | 1369.501143 | 1360.451215 | 1439.703425 | 1457.877654 | 1393.217357 | 1456.495662 | 1338.356251 | 1265.57185 | 1280.523597 | 1474.682207 | 1561.789931 | 1501.701358 | 1418.06307 | 1394.095503 | 1377.481718 | 1393.099208 | 1523.739922 | 1679.606554 | 1678.30187 | 1633.965605 | 1684.307717 | 1467.943378 | 1478.282137 |
| "std" | null | 72753.518671 | 74215.145424 | 69610.224744 | 72573.513699 | 73796.116656 | 80544.484681 | 75762.876927 | 68200.017009 | 71497.170772 | 72145.361308 | 70073.42218 | 72104.77572 | 73558.374194 | 70477.881612 | 69942.147529 | 70234.431406 | 67824.621598 | 65642.014625 | 67030.497267 | 72734.573261 | 70387.996054 | 69662.39481 | 69870.600105 | 66189.231249 | 68316.154206 | 72463.729508 | 72074.306801 | 71046.723761 | 99553.598227 | 65866.928901 | 63828.276878 | 64364.827426 | 68948.708847 | 67922.592758 | 73395.140765 | … | 70854.090103 | 81131.613335 | 78806.585704 | 75828.8807 | 81176.737177 | 72051.218875 | 75120.631235 | 69489.895525 | 82952.830879 | 80284.264809 | 89876.752877 | 84550.848696 | 81250.098882 | 77861.503263 | 69328.109597 | 80732.983057 | 80818.016903 | 87725.540567 | 83851.132297 | 80683.162031 | 73740.367492 | 69217.836755 | 65046.043865 | 78891.269618 | 89254.658123 | 89647.54663 | 91070.039106 | 85748.800943 | 77327.940225 | 84785.327946 | 87522.104316 | 97945.344124 | 92324.820056 | 91858.307668 | 90142.656814 | 81554.814146 | 88735.672589 |
| "min" | "!vote_en.wikipedia.org_all-access_all-agents" | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| "25%" | null | 13.0 | 13.0 | 12.0 | 13.0 | 14.0 | 11.0 | 13.0 | 13.0 | 14.0 | 14.0 | 13.0 | 12.0 | 14.0 | 13.0 | 14.0 | 13.0 | 14.0 | 13.0 | 14.0 | 15.0 | 11.0 | 12.0 | 11.0 | 9.0 | 10.0 | 9.0 | 9.0 | 11.0 | 12.0 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 | 11.0 | … | 21.0 | 21.0 | 23.0 | 22.0 | 23.0 | 24.0 | 22.0 | 22.0 | 21.0 | 22.0 | 22.0 | 22.0 | 22.0 | 22.0 | 22.0 | 21.0 | 21.0 | 23.0 | 21.0 | 21.0 | 23.0 | 23.0 | 23.0 | 23.0 | 23.0 | 24.0 | 23.0 | 22.0 | 22.0 | 20.0 | 21.0 | 22.0 | 23.0 | 24.0 | 23.0 | 23.0 | 21.0 |
| "50%" | null | 109.0 | 108.0 | 105.0 | 105.0 | 113.0 | 113.0 | 115.0 | 117.0 | 115.0 | 113.0 | 107.0 | 118.0 | 120.0 | 119.0 | 114.0 | 114.0 | 113.0 | 109.0 | 120.0 | 121.0 | 115.0 | 115.0 | 110.0 | 100.0 | 98.0 | 106.0 | 108.0 | 108.0 | 107.0 | 102.0 | 97.0 | 90.0 | 97.0 | 100.0 | 117.0 | … | 151.0 | 161.0 | 181.0 | 161.0 | 160.0 | 157.0 | 153.0 | 154.0 | 157.0 | 176.0 | 163.0 | 160.0 | 158.0 | 156.0 | 153.0 | 156.0 | 172.0 | 162.0 | 159.0 | 156.0 | 150.0 | 146.0 | 150.0 | 166.0 | 161.0 | 155.0 | 151.0 | 149.0 | 143.0 | 132.0 | 145.0 | 160.0 | 162.0 | 163.0 | 160.0 | 154.0 | 136.0 |
| "75%" | null | 524.0 | 519.0 | 504.0 | 487.0 | 540.0 | 555.0 | 551.0 | 554.0 | 549.0 | 545.0 | 516.0 | 582.0 | 590.0 | 583.0 | 559.0 | 560.0 | 545.0 | 516.0 | 576.0 | 592.0 | 567.0 | 562.0 | 531.0 | 482.0 | 461.0 | 510.0 | 525.0 | 515.0 | 508.0 | 493.0 | 464.0 | 422.0 | 473.0 | 492.0 | 576.0 | … | 659.0 | 686.0 | 788.0 | 730.0 | 718.0 | 694.0 | 681.0 | 669.0 | 667.0 | 763.0 | 727.0 | 714.0 | 706.0 | 691.0 | 667.0 | 660.0 | 741.0 | 723.0 | 704.0 | 693.0 | 657.0 | 615.0 | 619.0 | 696.0 | 681.0 | 646.0 | 623.0 | 607.0 | 598.0 | 569.0 | 628.0 | 659.0 | 668.0 | 654.0 | 649.0 | 635.0 | 561.0 |
| "max" | "[Alexandros]_ja.wikipedia.org_mobile-web_all-agents" | 2.0381245e7 | 2.0752194e7 | 1.9573967e7 | 2.0439645e7 | 2.0772109e7 | 2.2544669e7 | 2.1210887e7 | 1.9107911e7 | 1.9993848e7 | 2.0201821e7 | 1.9697811e7 | 2.0232622e7 | 2.0516284e7 | 1.9701242e7 | 1.9563106e7 | 1.9658463e7 | 1.8989969e7 | 1.8364385e7 | 1.8825015e7 | 2.0338627e7 | 1.9740114e7 | 1.9536361e7 | 1.9620144e7 | 1.8639334e7 | 1.9317992e7 | 2.0404878e7 | 2.0155604e7 | 1.9891238e7 | 2.7562967e7 | 1.8497101e7 | 1.7928619e7 | 1.8184589e7 | 1.9406979e7 | 1.8952314e7 | 2.0533464e7 | … | 2.0059135e7 | 2.2898094e7 | 2.2245105e7 | 2.1392793e7 | 2.2782984e7 | 2.0294284e7 | 2.1174392e7 | 1.9551058e7 | 2.2703751e7 | 2.2657555e7 | 2.5201283e7 | 2.3745159e7 | 2.2923017e7 | 2.1902049e7 | 1.9534309e7 | 2.2855008e7 | 2.2928843e7 | 2.47652e7 | 2.3661288e7 | 2.2659908e7 | 2.0834185e7 | 1.9653625e7 | 1.8573504e7 | 2.2423821e7 | 2.5067237e7 | 2.5044262e7 | 2.6283835e7 | 2.4201081e7 | 2.2539246e7 | 2.505662e7 | 2.5865746e7 | 2.8342885e7 | 2.6916991e7 | 2.7025053e7 | 2.6073819e7 | 2.4363967e7 | 2.6149541e7 |
df=df.unique()
Null Value Check
df.null_count().transpose(include_header=True)
| column | column_0 |
|---|---|
| str | u32 |
| "Page" | 0 |
| "2015-07-01" | 20740 |
| "2015-07-02" | 20816 |
| "2015-07-03" | 20544 |
| "2015-07-04" | 20654 |
| … | … |
| "2016-12-27" | 3701 |
| "2016-12-28" | 3822 |
| "2016-12-29" | 3826 |
| "2016-12-30" | 3635 |
| "2016-12-31" | 3465 |
Observations
- We can see that the dataset has lot of null values at the start and decreases at the end
Data Cleaning¶
df=df.fill_nan(0)
df=df.fill_null(0)
df.select("Page").sample(10, seed=random.randint(0,1000))
| Page |
|---|
| str |
| "Jamie_Dornan_de.wikipedia.org_desktop_all-agents" |
| "Große_Winkelspinne_de.wikipedia.org_all-access_spider" |
| "兵家綺_zh.wikipedia.org_mobile-web_all-agents" |
| "Rally_Dakar_es.wikipedia.org_mobile-web_all-agents" |
| "Дэдпул_ru.wikipedia.org_all-access_spider" |
| "How_to_contribute/wuu_www.mediawiki.org_desktop_all-agents" |
| "高梁碧_ja.wikipedia.org_all-access_all-agents" |
| "魏如昀_zh.wikipedia.org_desktop_all-agents" |
| "Lion_(2016_film)_en.wikipedia.org_mobile-web_all-agents" |
| "Massaker_von_Srebrenica_de.wikipedia.org_all-access_all-agents" |
db.sql("""
select Page from df where Page like '%_ja.%'
""").pl().sample(5)
| Page |
|---|
| str |
| "熊切あさ美_ja.wikipedia.org_mobile-web_all-agents" |
| "吉田真由子_ja.wikipedia.org_mobile-web_all-agents" |
| "玉木宏_ja.wikipedia.org_all-access_spider" |
| "椿隆之_ja.wikipedia.org_all-access_all-agents" |
| "岡田有希子_ja.wikipedia.org_all-access_spider" |
# 'Manual:Installation_guide/en-gb_www.mediawiki.org_mobile-web_all-agents',
extracted_df = df.select(
pl.col("Page").str.extract(r"/(\w{2})_www.mediawiki.org", group_index=1).alias("mediawiki_lang"),
pl.col("Page").str.extract(r"_(\w{2})\.wikipedia", group_index=1).alias("wikipedia_lang"),
pl.col("Page").str.extract(r"/(\w{2}-\w{2})_www.mediawiki.org", group_index=1).alias("separator_lang"),
)
extracted_df=extracted_df.with_columns(
pl.concat_str([pl.col("mediawiki_lang"), pl.col("wikipedia_lang"), pl.col("separator_lang")], separator="", ignore_nulls=True).alias("lang")
)
extracted_df
| mediawiki_lang | wikipedia_lang | separator_lang | lang |
|---|---|---|---|
| str | str | str | str |
| null | "fr" | null | "fr" |
| null | "ja" | null | "ja" |
| null | null | null | "" |
| null | "en" | null | "en" |
| null | null | null | "" |
| … | … | … | … |
| null | "en" | null | "en" |
| null | null | null | "" |
| null | "zh" | null | "zh" |
| null | "es" | null | "es" |
| null | "de" | null | "de" |
extracted_df.group_by("lang").len().sort("len", descending=True).limit(10)
| lang | len |
|---|---|
| str | u32 |
| "en" | 24120 |
| "ja" | 20515 |
| "de" | 18692 |
| "fr" | 17897 |
| "zh" | 17330 |
| "" | 15775 |
| "ru" | 15131 |
| "es" | 14172 |
| "ar" | 70 |
| "pt-br" | 66 |
df=df.with_columns(
pl.col("Page").str.split("_").list[-1].alias("Agent"),
pl.col("Page").str.split("_").list[-2].alias("Access"),
)
df = df.hstack(extracted_df.select("lang"))
df = df.with_columns(
pl.col("Page").str.replace_all(r"_all-access|_all-agents|_desktop|_all-access|_spider|_mobile-web", "").alias("Page"),
)
df.select("Page", "lang").filter(pl.col("lang")=="").sample(10, seed=random.randint(0,1000))
| Page | lang |
|---|---|
| str | str |
| "File:Decathlon_Logo.png_commons.wikimedia.org" | "" |
| "File:CharlesManson2014.jpg_commons.wikimedia.org" | "" |
| "Category:Chikki_Panday_commons.wikimedia.org" | "" |
| "UploadWizard_www.mediawiki.org" | "" |
| "File:Brickwork_for_wall_construction.jpg_commons.wikimedia.org" | "" |
| "Topic:T9a0kjggitqfxog0_www.mediawiki.org" | "" |
| "Commons:Wiki_Loves_Earth_2016_in_Kosovo_commons.wikimedia.org" | "" |
| "Category:Doggy_style_in_art_commons.wikimedia.org" | "" |
| "Extension:Lucene-search_www.mediawiki.org" | "" |
| "Special:MyLanguage/Manual:Configuration_settings_www.mediawiki.org" | "" |
df=df.with_columns(
pl.when(pl.col("lang") == "").then(pl.lit("other")).otherwise(pl.col("lang")).alias("lang")
)
df.group_by("lang").len().sort("len", descending=True).limit(10)
| lang | len |
|---|---|
| str | u32 |
| "en" | 24120 |
| "ja" | 20515 |
| "de" | 18692 |
| "fr" | 17897 |
| "zh" | 17330 |
| "other" | 15775 |
| "ru" | 15131 |
| "es" | 14172 |
| "ar" | 70 |
| "pt-br" | 66 |
df = df.with_columns(
pl.col("lang").count().over("lang").alias("lang_count")
)
df = df.with_columns(
pl.when(pl.col("lang_count") < 1000).then(pl.lit("other")).otherwise(pl.col("lang")).alias("lang")
)
df=df.drop("lang_count")
df.null_count().transpose(include_header=True)
| column | column_0 |
|---|---|
| str | u32 |
| "Page" | 0 |
| "2015-07-01" | 0 |
| "2015-07-02" | 0 |
| "2015-07-03" | 0 |
| "2015-07-04" | 0 |
| … | … |
| "2016-12-30" | 0 |
| "2016-12-31" | 0 |
| "Agent" | 0 |
| "Access" | 0 |
| "lang" | 0 |
# df=df.cast({pl.Int64: pl.Int32})
lang_df
| code | lang |
|---|---|
| str | str |
| "ce" | "Chechen" |
| "ml" | "Malayalam" |
| "vi" | "Vietnamese" |
| "be" | "Belarusian" |
| "os" | "Ossetian" |
| … | … |
| "st" | "Sesotho" |
| "lt" | "Lithuanian" |
| "ku" | "Kurdish" |
| "nd" | "North Ndebele" |
| "ik" | "Inupiak" |
df=db.sql("""
select df.*, ifnull(lang_df.lang,'Other') as lan from df left join lang_df on df.lang = lang_df.code
""").pl().drop("lang").rename({"lan": "lang"})
df.write_parquet("../data/cleaned/df.parquet")
Preprocessing the data¶
cleaned_df=pl.read_parquet("../data/cleaned/df.parquet")
cleaned_df.head(10)
| Page | 2015-07-01 | 2015-07-02 | 2015-07-03 | 2015-07-04 | 2015-07-05 | 2015-07-06 | 2015-07-07 | 2015-07-08 | 2015-07-09 | 2015-07-10 | 2015-07-11 | 2015-07-12 | 2015-07-13 | 2015-07-14 | 2015-07-15 | 2015-07-16 | 2015-07-17 | 2015-07-18 | 2015-07-19 | 2015-07-20 | 2015-07-21 | 2015-07-22 | 2015-07-23 | 2015-07-24 | 2015-07-25 | 2015-07-26 | 2015-07-27 | 2015-07-28 | 2015-07-29 | 2015-07-30 | 2015-07-31 | 2015-08-01 | 2015-08-02 | 2015-08-03 | 2015-08-04 | 2015-08-05 | … | 2016-11-28 | 2016-11-29 | 2016-11-30 | 2016-12-01 | 2016-12-02 | 2016-12-03 | 2016-12-04 | 2016-12-05 | 2016-12-06 | 2016-12-07 | 2016-12-08 | 2016-12-09 | 2016-12-10 | 2016-12-11 | 2016-12-12 | 2016-12-13 | 2016-12-14 | 2016-12-15 | 2016-12-16 | 2016-12-17 | 2016-12-18 | 2016-12-19 | 2016-12-20 | 2016-12-21 | 2016-12-22 | 2016-12-23 | 2016-12-24 | 2016-12-25 | 2016-12-26 | 2016-12-27 | 2016-12-28 | 2016-12-29 | 2016-12-30 | 2016-12-31 | Agent | Access | lang |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | … | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | str | str | str |
| "Jamie_Foxx_en.wikipedia.org" | 9819 | 5274 | 5653 | 5571 | 5695 | 5253 | 5197 | 4940 | 4830 | 4975 | 4803 | 6537 | 5744 | 4540 | 4587 | 4866 | 4927 | 5094 | 5533 | 4723 | 4635 | 4660 | 4623 | 4641 | 5217 | 5648 | 4997 | 4946 | 4658 | 4861 | 5154 | 4755 | 5892 | 10003 | 5364 | 4518 | … | 3850 | 3703 | 3420 | 5007 | 4463 | 5098 | 5407 | 4574 | 4208 | 4341 | 4428 | 4968 | 4493 | 4370 | 4019 | 6256 | 5199 | 4483 | 4687 | 5091 | 5527 | 7372 | 6168 | 4999 | 5358 | 4178 | 4418 | 5204 | 6148 | 5928 | 6153 | 8182 | 6600 | 5621 | "all-agents" | "all-access" | "English" |
| "Ryan_Phillippe_fr.wikipedia.org" | 5 | 4 | 6 | 3 | 6 | 1 | 1 | 5 | 6 | 3 | 2 | 4 | 5 | 3 | 1 | 1 | 3 | 8 | 4 | 8 | 0 | 2 | 4 | 2 | 4 | 2 | 1 | 5 | 3 | 2 | 2 | 3 | 1 | 0 | 2 | 2 | … | 8 | 34 | 10 | 4 | 8 | 8 | 4 | 2 | 13 | 6 | 3 | 4 | 7 | 5 | 4 | 3 | 1 | 5 | 5 | 5 | 0 | 4 | 2 | 3 | 0 | 4 | 4 | 2 | 11 | 11 | 3 | 4 | 13 | 6 | "spider" | "all-access" | "French" |
| "徐朱玄_zh.wikipedia.org" | 336 | 317 | 333 | 333 | 465 | 988 | 736 | 736 | 719 | 579 | 611 | 495 | 410 | 355 | 361 | 385 | 412 | 363 | 355 | 341 | 450 | 377 | 371 | 302 | 282 | 265 | 302 | 288 | 314 | 279 | 292 | 248 | 259 | 293 | 240 | 265 | … | 166 | 202 | 150 | 158 | 151 | 230 | 212 | 178 | 175 | 140 | 164 | 177 | 189 | 186 | 173 | 141 | 155 | 141 | 121 | 150 | 172 | 122 | 136 | 151 | 132 | 170 | 164 | 188 | 156 | 195 | 199 | 217 | 197 | 172 | "all-agents" | "desktop" | "Chinese" |
| "血型_zh.wikipedia.org" | 337 | 291 | 300 | 249 | 251 | 243 | 254 | 224 | 822 | 338 | 260 | 261 | 205 | 841 | 395 | 298 | 281 | 317 | 293 | 235 | 278 | 277 | 303 | 214 | 243 | 265 | 235 | 235 | 195 | 188 | 186 | 174 | 198 | 204 | 219 | 379 | … | 297 | 252 | 275 | 292 | 244 | 296 | 291 | 330 | 391 | 343 | 360 | 305 | 408 | 315 | 287 | 311 | 342 | 291 | 322 | 310 | 313 | 305 | 301 | 285 | 349 | 377 | 366 | 258 | 264 | 258 | 341 | 289 | 250 | 255 | "all-agents" | "mobile-web" | "Chinese" |
| "Christina_Grimmie_en.wikipedia.org" | 1067 | 1338 | 1187 | 1107 | 1153 | 1166 | 1127 | 1139 | 1122 | 1131 | 1066 | 1101 | 1103 | 1119 | 1044 | 1170 | 1161 | 1217 | 1236 | 1171 | 1105 | 1130 | 1269 | 1250 | 1162 | 1190 | 1331 | 1250 | 1464 | 1203 | 1252 | 1099 | 1215 | 1049 | 1205 | 1266 | … | 4415 | 4620 | 4715 | 4876 | 5427 | 6125 | 5638 | 4731 | 8408 | 10580 | 10414 | 8329 | 7713 | 8076 | 7358 | 6815 | 7754 | 10786 | 11808 | 10147 | 6885 | 7106 | 6569 | 10555 | 13525 | 9357 | 7168 | 6370 | 8675 | 10827 | 20408 | 25047 | 22465 | 21219 | "all-agents" | "all-access" | "English" |
| "Adele_en.wikipedia.org" | 97 | 63 | 85 | 66 | 79 | 48 | 77 | 58 | 85 | 73 | 98 | 49 | 40 | 52 | 51 | 80 | 172 | 128 | 76 | 60 | 129 | 61 | 56 | 78 | 177 | 25 | 76 | 58 | 59 | 32 | 31 | 44 | 19 | 62 | 32 | 33 | … | 161 | 171 | 171 | 173 | 207 | 239 | 142 | 167 | 170 | 220 | 269 | 445 | 312 | 192 | 210 | 201 | 332 | 157 | 202 | 220 | 153 | 214 | 261 | 299 | 150 | 194 | 211 | 182 | 209 | 322 | 688 | 259 | 161 | 171 | "spider" | "all-access" | "English" |
| "Élodie_Fontan_fr.wikipedia.org" | 2 | 2 | 6 | 5 | 3 | 7 | 1 | 5 | 5 | 3 | 5 | 4 | 6 | 6 | 2 | 3 | 2 | 6 | 4 | 4 | 6 | 6 | 1 | 4 | 6 | 3 | 4 | 0 | 5 | 1 | 4 | 0 | 2 | 3 | 3 | 5 | … | 3 | 13 | 4 | 2 | 7 | 4 | 6 | 2 | 8 | 1 | 3 | 3 | 10 | 2 | 5 | 2 | 2 | 8 | 5 | 7 | 6 | 6 | 9 | 6 | 5 | 1 | 37 | 14 | 2 | 5 | 12 | 13 | 12 | 8 | "spider" | "all-access" | "French" |
| "Loi_de_Murphy_fr.wikipedia.org" | 1066 | 1148 | 1028 | 840 | 877 | 959 | 971 | 1083 | 1164 | 1007 | 784 | 889 | 860 | 826 | 1001 | 1046 | 974 | 833 | 907 | 1069 | 1119 | 1093 | 997 | 994 | 882 | 899 | 1043 | 1019 | 1066 | 946 | 1002 | 896 | 806 | 762 | 1042 | 2256 | … | 1247 | 954 | 1021 | 890 | 791 | 729 | 718 | 1896 | 1269 | 1183 | 920 | 859 | 704 | 902 | 912 | 997 | 966 | 794 | 904 | 658 | 640 | 689 | 744 | 785 | 745 | 783 | 787 | 758 | 773 | 786 | 1191 | 1167 | 854 | 631 | "all-agents" | "all-access" | "French" |
| "おのののか_ja.wikipedia.org" | 1810 | 1879 | 3057 | 1963 | 1701 | 1620 | 919 | 1201 | 1328 | 769 | 2421 | 1503 | 817 | 1331 | 790 | 1426 | 1444 | 1636 | 39250 | 15692 | 7354 | 5008 | 4234 | 3985 | 3704 | 2730 | 2443 | 5165 | 2308 | 3278 | 2158 | 1723 | 3105 | 1864 | 2678 | 3286 | … | 578 | 826 | 673 | 529 | 473 | 1310 | 1010 | 475 | 424 | 391 | 552 | 581 | 925 | 709 | 698 | 1138 | 616 | 587 | 394 | 691 | 1571 | 466 | 407 | 333 | 373 | 430 | 879 | 1567 | 519 | 4327 | 1226 | 1314 | 757 | 1715 | "all-agents" | "mobile-web" | "Japanese" |
| "Silvestre_(cantante)_es.wikipedia.org" | 14 | 10 | 8 | 9 | 11 | 19 | 12 | 11 | 24 | 15 | 18 | 14 | 17 | 18 | 11 | 16 | 40 | 21 | 24 | 19 | 16 | 15 | 19 | 16 | 15 | 12 | 20 | 14 | 23 | 20 | 19 | 22 | 26 | 14 | 21 | 15 | … | 17 | 16 | 18 | 14 | 14 | 12 | 10 | 11 | 17 | 14 | 11 | 23 | 15 | 7409 | 5467 | 2617 | 3477 | 2626 | 1160 | 404 | 863 | 552 | 197 | 132 | 155 | 67 | 50 | 37 | 113 | 70 | 35 | 25 | 32 | 36 | "all-agents" | "desktop" | "Spanish" |
page_view_count = cleaned_df.select(pl.exclude("Page", "lang", "Agent", "Access")).sum_horizontal().to_frame("page_view_count")
cleaned_df = cleaned_df.hstack(page_view_count)
page_view_count
| page_view_count |
|---|
| i64 |
| 3192217 |
| 2694 |
| 167814 |
| 190388 |
| 6054032 |
| … |
| 1678 |
| 26091 |
| 11979 |
| 818 |
| 9236 |
fig, ax = plt.subplots(1,2, figsize=(15, 6))
cleaned_df.group_by("lang").len().sort("len", descending=True).to_pandas().plot.bar(x="lang", y="len", ax=ax[0])
ax[0].set_title("Number of pages per language");
cleaned_df.group_by("lang").agg(pl.sum("page_view_count")).sort("page_view_count", descending=True).to_pandas().plot.bar(x="lang", y="page_view_count", ax=ax[1])
ax[1].set_title("Total page views per language");
Observations
- From above plot we can that English language has the highest number of pages
- This has resulted in the highest number of views as well
fig, ax = plt.subplots(1,2, figsize=(15, 6))
cleaned_df.group_by("Access").agg(pl.sum("page_view_count")).sort("page_view_count", descending=True).to_pandas().plot.bar(x="Access", y="page_view_count", ax=ax[0])
ax[0].set_title("Total page views per Access");
cleaned_df.group_by("Agent").agg(pl.sum("page_view_count")).sort("page_view_count", descending=True).to_pandas().plot.bar(x="Agent", y="page_view_count", ax=ax[1])
ax[1].set_title("Total page views per Agent");
Observations
- From above plot we can see most views are from "all-agents" source and "all-access"
cleaned_df.group_by("Page").agg(pl.sum("page_view_count")).sort("page_view_count", descending=True).limit(10).to_pandas().plot.barh(x="Page", y="page_view_count")
<Axes: ylabel='Page'>
Observations
- Above is the list of top 10 pages with highest number of views
- We can see that English homepage has the highest number of views
db.sql("""
with cte as (
select
Page,
lang,
sum(page_view_count) as page_view_count,
rank() over (partition by lang order by sum(page_view_count) desc) as rank
from cleaned_df
group by Page,
lang
)
select * from cte where rank = 1 order by page_view_count desc
""").pl()
| Page | lang | page_view_count | rank |
|---|---|---|---|
| str | str | decimal[38,0] | i64 |
| "Main_Page_en.wikipedia.org" | "English" | 24123683849 | 1 |
| "Wikipedia:Hauptseite_de.wikipedia.org" | "German" | 3167931410 | 1 |
| "Заглавная_страница_ru.wikipedia.org" | "Russian" | 2164257149 | 1 |
| "Wikipédia:Accueil_principal_fr.wikipedia.org" | "French" | 1783233299 | 1 |
| "Wikipedia:Portada_es.wikipedia.org" | "Spanish" | 1513624439 | 1 |
| "メインページ_ja.wikipedia.org" | "Japanese" | 417501194 | 1 |
| "Wikipedia:首页_zh.wikipedia.org" | "Chinese" | 245697901 | 1 |
| "Special:Search_commons.wikimedia.org" | "Other" | 134968447 | 1 |
Observations
- Above is the list of top viewed pages in each language
lang_grp = cleaned_df.group_by("lang").sum().drop("Agent", "Access", "page_view_count", "Page")
lang_grp
| lang | 2015-07-01 | 2015-07-02 | 2015-07-03 | 2015-07-04 | 2015-07-05 | 2015-07-06 | 2015-07-07 | 2015-07-08 | 2015-07-09 | 2015-07-10 | 2015-07-11 | 2015-07-12 | 2015-07-13 | 2015-07-14 | 2015-07-15 | 2015-07-16 | 2015-07-17 | 2015-07-18 | 2015-07-19 | 2015-07-20 | 2015-07-21 | 2015-07-22 | 2015-07-23 | 2015-07-24 | 2015-07-25 | 2015-07-26 | 2015-07-27 | 2015-07-28 | 2015-07-29 | 2015-07-30 | 2015-07-31 | 2015-08-01 | 2015-08-02 | 2015-08-03 | 2015-08-04 | 2015-08-05 | … | 2016-11-25 | 2016-11-26 | 2016-11-27 | 2016-11-28 | 2016-11-29 | 2016-11-30 | 2016-12-01 | 2016-12-02 | 2016-12-03 | 2016-12-04 | 2016-12-05 | 2016-12-06 | 2016-12-07 | 2016-12-08 | 2016-12-09 | 2016-12-10 | 2016-12-11 | 2016-12-12 | 2016-12-13 | 2016-12-14 | 2016-12-15 | 2016-12-16 | 2016-12-17 | 2016-12-18 | 2016-12-19 | 2016-12-20 | 2016-12-21 | 2016-12-22 | 2016-12-23 | 2016-12-24 | 2016-12-25 | 2016-12-26 | 2016-12-27 | 2016-12-28 | 2016-12-29 | 2016-12-30 | 2016-12-31 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | … | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
| "Spanish" | 15279998 | 14602449 | 13428998 | 12607803 | 13711974 | 15627539 | 15232641 | 14783419 | 14504467 | 13185904 | 11858311 | 13510484 | 15327578 | 16096772 | 15212758 | 14485800 | 13036926 | 11343056 | 12132508 | 14252905 | 14118075 | 14528019 | 13631289 | 12107623 | 10849223 | 11964941 | 13780739 | 13678565 | 13832158 | 13156141 | 11885923 | 10380883 | 11394918 | 11743830 | 14245528 | 14416982 | … | 15764081 | 16627421 | 19190118 | 19453791 | 22936644 | 19855124 | 17733972 | 15845342 | 12804503 | 14932115 | 16799286 | 17859750 | 16370034 | 15832607 | 14290479 | 12123873 | 13478694 | 15310505 | 15285661 | 16108308 | 14209518 | 12937405 | 12171378 | 14251871 | 15949043 | 15488294 | 15145135 | 13839162 | 13157437 | 12386080 | 12714491 | 16820407 | 15946677 | 16578967 | 15648438 | 11562101 | 11079339 |
| "Other" | 1477069 | 1548788 | 1464346 | 1249447 | 1385036 | 1586847 | 1679560 | 1755740 | 1530950 | 1547281 | 1165601 | 1205769 | 1411866 | 1588920 | 1683736 | 1616798 | 1773096 | 1595188 | 1468167 | 1465313 | 1816109 | 1955677 | 1715826 | 1622922 | 1479605 | 1408964 | 1782903 | 1683857 | 1797760 | 1715689 | 2892519 | 1680268 | 3061505 | 1670193 | 1962956 | 2007787 | … | 2041914 | 2158720 | 2142316 | 2520928 | 4157704 | 2448214 | 2377015 | 2323118 | 2623705 | 2827612 | 2822307 | 2646563 | 2606370 | 2542841 | 4827018 | 3734861 | 2859781 | 2870442 | 2476462 | 2624603 | 2541253 | 2441475 | 3081714 | 2989461 | 2613503 | 3027418 | 3680641 | 2336215 | 2931946 | 2491251 | 2913440 | 2947931 | 2615280 | 3316427 | 2620026 | 2921474 | 2559123 |
| "French" | 8460177 | 8514332 | 8187395 | 8750945 | 8591590 | 8951425 | 8652420 | 8493256 | 8405132 | 7932023 | 7860203 | 8976095 | 8753299 | 9387693 | 8662008 | 8470259 | 8153328 | 8252934 | 8947366 | 8726022 | 8498414 | 8551485 | 8055929 | 7541728 | 7557274 | 8199660 | 7949596 | 7851719 | 7964070 | 7603077 | 7142116 | 6916462 | 7058655 | 7200511 | 8653757 | 8349621 | … | 10935645 | 12511342 | 14111910 | 12358937 | 12308122 | 11452556 | 10940544 | 10776821 | 12107205 | 14522990 | 13134474 | 12991885 | 12184132 | 11950583 | 11885906 | 12136360 | 13313653 | 13348336 | 12150770 | 11659149 | 11378131 | 11908293 | 12459646 | 14252394 | 15021807 | 13143504 | 12057963 | 11608728 | 11385252 | 11006683 | 11868613 | 16679879 | 15282895 | 13783146 | 13401250 | 12472948 | 11505867 |
| "English" | 84712323 | 84438676 | 80167807 | 83463351 | 86198774 | 92809569 | 87838163 | 82880435 | 84799344 | 84319604 | 84789202 | 93279023 | 92392052 | 92916294 | 85897549 | 86911252 | 83306009 | 80407700 | 85355858 | 89703719 | 85445610 | 84101862 | 82835476 | 80528880 | 80688971 | 86443149 | 87839162 | 86547291 | 100997970 | 79149920 | 76561432 | 76695026 | 83480508 | 78772328 | 84914381 | 85031471 | … | 107788266 | 121988297 | 127491196 | 116790779 | 120243129 | 112338727 | 111159612 | 109467860 | 123415872 | 124686850 | 127319077 | 126321875 | 119067662 | 116664998 | 108556921 | 113151730 | 117612769 | 121038477 | 115257216 | 121089002 | 110158675 | 103840461 | 103804092 | 120029012 | 130940080 | 128043995 | 118888302 | 120491905 | 112905519 | 119861247 | 123926452 | 139112369 | 145628953 | 141278597 | 150557723 | 125404796 | 123624015 |
| "Chinese" | 4147071 | 4152793 | 4125028 | 4164784 | 4442807 | 4466109 | 4461263 | 4577333 | 4549099 | 4729303 | 4412580 | 4971098 | 4562122 | 4309581 | 4226023 | 4215704 | 4233702 | 4411839 | 4973591 | 4678115 | 4602217 | 4659355 | 4146985 | 3601962 | 3465035 | 3740949 | 3741118 | 3693489 | 3730581 | 3586615 | 3379722 | 3072096 | 3374013 | 3743645 | 4413556 | 4576750 | … | 6032057 | 7612803 | 7569312 | 6196755 | 6315070 | 6150957 | 5978330 | 6041775 | 6839544 | 7111855 | 6349308 | 6166004 | 5994778 | 5773036 | 6099685 | 6548693 | 7416420 | 6395366 | 6185631 | 6041202 | 5841504 | 5599683 | 6211411 | 6885894 | 6197343 | 6251170 | 6429761 | 5948263 | 5867237 | 6216614 | 6571114 | 6487253 | 6480128 | 6515403 | 6044182 | 6114003 | 6300124 |
| "Japanese" | 11864936 | 13622479 | 12307000 | 15457627 | 14828618 | 12922630 | 12570816 | 12494311 | 12179762 | 12654331 | 13068243 | 14059069 | 12517398 | 11727093 | 11752028 | 13443986 | 16093387 | 13532831 | 13269906 | 13273637 | 11599116 | 11638257 | 11057718 | 10060492 | 9804823 | 10497848 | 10145374 | 9917127 | 10798926 | 9743033 | 9130774 | 8862972 | 9887186 | 10682450 | 12344808 | 12153812 | … | 15376787 | 15989146 | 17342398 | 14386738 | 16345439 | 13601936 | 16271026 | 15232048 | 15870467 | 17509081 | 15325843 | 14397142 | 17702667 | 15595682 | 16881040 | 16319927 | 17469249 | 15946503 | 15250668 | 16605887 | 13999336 | 13786593 | 15686739 | 19008466 | 15993510 | 16059254 | 14547788 | 13794046 | 19778445 | 17502241 | 16721058 | 15919063 | 16124459 | 16152145 | 17683587 | 19452390 | 24461932 |
| "Russian" | 9466267 | 9630020 | 8925576 | 8395377 | 8940768 | 9631695 | 9411165 | 9366525 | 9594537 | 10987503 | 10115500 | 11188984 | 11540937 | 11187016 | 10706689 | 10918405 | 10782009 | 9666220 | 10706183 | 10985704 | 10691482 | 10435144 | 9378083 | 8321106 | 7587437 | 8477490 | 9162428 | 8828000 | 8655581 | 8443422 | 8293395 | 7428512 | 8290129 | 10521832 | 11068549 | 10585083 | … | 13585166 | 15562912 | 16532952 | 16003385 | 16646300 | 15718372 | 15934987 | 15005949 | 15406632 | 16114535 | 14812896 | 14461251 | 13869482 | 14229316 | 13818039 | 14327809 | 14681438 | 14534200 | 14291605 | 14287936 | 14827122 | 13838997 | 13956864 | 14981542 | 14423262 | 14802559 | 14323495 | 13466853 | 13294091 | 13134958 | 16841275 | 16718751 | 15041813 | 14002376 | 13480581 | 12069020 | 13224530 |
| "German" | 13264635 | 13084303 | 12558048 | 11523603 | 13395805 | 14746420 | 14301345 | 14509892 | 13968844 | 13018538 | 12042239 | 14285542 | 15589235 | 14837236 | 13848444 | 13716331 | 14115776 | 16607195 | 15678913 | 14369251 | 13400949 | 13672098 | 13281758 | 12595709 | 12519163 | 13084680 | 14009715 | 13899626 | 13864592 | 13250925 | 12411142 | 12154136 | 12572573 | 13074247 | 13848166 | 13260306 | … | 15132115 | 15534556 | 18391022 | 18372778 | 17208371 | 15651811 | 15458695 | 15310834 | 14535957 | 17045806 | 16622450 | 15938927 | 15829195 | 17250754 | 16581711 | 14076246 | 16860116 | 16882306 | 15909264 | 16884259 | 15732617 | 14721713 | 13827367 | 16117000 | 19715728 | 16117550 | 16046546 | 15375054 | 15564809 | 15045483 | 23776959 | 22382414 | 20127904 | 19155875 | 18450782 | 17611564 | 16566864 |
df=lang_grp.unpivot(index="lang", variable_name="dates").pivot("lang", index="dates")
df=df.with_columns(
pl.col("dates").cast(pl.Date),
)
campaign_df = pl.read_csv("../data/processed/Exog_Campaign_eng.csv", has_header=True, ignore_errors=True)
campaign_df
| Exog |
|---|
| i64 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| … |
| 1 |
| 1 |
| 1 |
| 0 |
| 0 |
df=df.hstack(campaign_df)
df.head(10)
| dates | Spanish | Other | French | English | Chinese | Japanese | Russian | German | Exog |
|---|---|---|---|---|---|---|---|---|---|
| date | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
| 2015-07-01 | 15279998 | 1477069 | 8460177 | 84712323 | 4147071 | 11864936 | 9466267 | 13264635 | 0 |
| 2015-07-02 | 14602449 | 1548788 | 8514332 | 84438676 | 4152793 | 13622479 | 9630020 | 13084303 | 0 |
| 2015-07-03 | 13428998 | 1464346 | 8187395 | 80167807 | 4125028 | 12307000 | 8925576 | 12558048 | 0 |
| 2015-07-04 | 12607803 | 1249447 | 8750945 | 83463351 | 4164784 | 15457627 | 8395377 | 11523603 | 0 |
| 2015-07-05 | 13711974 | 1385036 | 8591590 | 86198774 | 4442807 | 14828618 | 8940768 | 13395805 | 0 |
| 2015-07-06 | 15627539 | 1586847 | 8951425 | 92809569 | 4466109 | 12922630 | 9631695 | 14746420 | 0 |
| 2015-07-07 | 15232641 | 1679560 | 8652420 | 87838163 | 4461263 | 12570816 | 9411165 | 14301345 | 0 |
| 2015-07-08 | 14783419 | 1755740 | 8493256 | 82880435 | 4577333 | 12494311 | 9366525 | 14509892 | 0 |
| 2015-07-09 | 14504467 | 1530950 | 8405132 | 84799344 | 4549099 | 12179762 | 9594537 | 13968844 | 0 |
| 2015-07-10 | 13185904 | 1547281 | 7932023 | 84319604 | 4729303 | 12654331 | 10987503 | 13018538 | 0 |
df = df.with_columns(
pl.col("dates").dt.strftime("%A").alias("weekday"),
pl.col("dates").dt.strftime("%B").alias("month"),
pl.col("dates").dt.year().alias("year"),
)
fig, ax = plt.subplots(4, 2, figsize=(20, 25))
for i, col in enumerate(columns):
df.group_by("weekday").agg(pl.sum(col)).to_pandas().plot.bar(x="weekday", y=col, ax=ax[i//2, i%2])
ax[i//2, i%2].set_title(f"{col} page views per weekday")
plt.tight_layout()
Observations
fig, ax = plt.subplots(4, 2, figsize=(20, 25))
for i, col in enumerate(columns):
df.group_by("month").agg(pl.sum(col)).to_pandas().plot.bar(x="month", y=col, ax=ax[i//2, i%2])
ax[i//2, i%2].set_title(f"{col} page views per month")
plt.tight_layout()
Observations
- Above plot shows page views for each language for each month
- For Spanish language October, September and November have the highest number of views
df.head(10)
| dates | Spanish | Other | French | English | Chinese | Japanese | Russian | German | Exog | weekday | month | year |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| date | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | str | i8 | i32 |
| 2015-07-01 | 15279998 | 1477069 | 8460177 | 84712323 | 4147071 | 11864936 | 9466267 | 13264635 | 0 | "Wednesday" | 7 | 2015 |
| 2015-07-02 | 14602449 | 1548788 | 8514332 | 84438676 | 4152793 | 13622479 | 9630020 | 13084303 | 0 | "Thursday" | 7 | 2015 |
| 2015-07-03 | 13428998 | 1464346 | 8187395 | 80167807 | 4125028 | 12307000 | 8925576 | 12558048 | 0 | "Friday" | 7 | 2015 |
| 2015-07-04 | 12607803 | 1249447 | 8750945 | 83463351 | 4164784 | 15457627 | 8395377 | 11523603 | 0 | "Saturday" | 7 | 2015 |
| 2015-07-05 | 13711974 | 1385036 | 8591590 | 86198774 | 4442807 | 14828618 | 8940768 | 13395805 | 0 | "Sunday" | 7 | 2015 |
| 2015-07-06 | 15627539 | 1586847 | 8951425 | 92809569 | 4466109 | 12922630 | 9631695 | 14746420 | 0 | "Monday" | 7 | 2015 |
| 2015-07-07 | 15232641 | 1679560 | 8652420 | 87838163 | 4461263 | 12570816 | 9411165 | 14301345 | 0 | "Tuesday" | 7 | 2015 |
| 2015-07-08 | 14783419 | 1755740 | 8493256 | 82880435 | 4577333 | 12494311 | 9366525 | 14509892 | 0 | "Wednesday" | 7 | 2015 |
| 2015-07-09 | 14504467 | 1530950 | 8405132 | 84799344 | 4549099 | 12179762 | 9594537 | 13968844 | 0 | "Thursday" | 7 | 2015 |
| 2015-07-10 | 13185904 | 1547281 | 7932023 | 84319604 | 4729303 | 12654331 | 10987503 | 13018538 | 0 | "Friday" | 7 | 2015 |
df.write_parquet("../data/processed/df.parquet")
Time Series Analysis¶
df=pl.read_parquet("../data/processed/df.parquet").to_pandas()
df=df.set_index("dates")
df=df.drop("Other", axis=1)
df
| Spanish | Japanese | English | Chinese | German | Russian | French | Exog | |
|---|---|---|---|---|---|---|---|---|
| dates | ||||||||
| 2015-07-01 | 15279998 | 11864936 | 84712323 | 4147071 | 13264635 | 9466267 | 8460177 | 0 |
| 2015-07-02 | 14602449 | 13622479 | 84438676 | 4152793 | 13084303 | 9630020 | 8514332 | 0 |
| 2015-07-03 | 13428998 | 12307000 | 80167807 | 4125028 | 12558048 | 8925576 | 8187395 | 0 |
| 2015-07-04 | 12607803 | 15457627 | 83463351 | 4164784 | 11523603 | 8395377 | 8750945 | 0 |
| 2015-07-05 | 13711974 | 14828618 | 86198774 | 4442807 | 13395805 | 8940768 | 8591590 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2016-12-27 | 15946677 | 16124459 | 145628953 | 6480128 | 20127904 | 15041813 | 15282895 | 1 |
| 2016-12-28 | 16578967 | 16152145 | 141278597 | 6515403 | 19155875 | 14002376 | 13783146 | 1 |
| 2016-12-29 | 15648438 | 17683587 | 150557723 | 6044182 | 18450782 | 13480581 | 13401250 | 1 |
| 2016-12-30 | 11562101 | 19452390 | 125404796 | 6114003 | 17611564 | 12069020 | 12472948 | 0 |
| 2016-12-31 | 11079339 | 24461932 | 123624015 | 6300124 | 16566864 | 13224530 | 11505867 | 0 |
550 rows × 8 columns
columns = ['Spanish', 'Japanese', 'English', 'Chinese', 'German', 'Russian', 'French']
columns
['Spanish', 'Japanese', 'English', 'Chinese', 'German', 'Russian', 'French']
Combined Time Series Plot
df["Exog_scaled"] = df["Exog"]*df["English"]
filtered_df = df[df["Exog_scaled"] > 0]
df=df.drop("Exog_scaled", axis=1)
df.drop(["Exog"], axis=1).plot(figsize=(20, 8));
plt.scatter(filtered_df.index, filtered_df["Exog_scaled"], color="green", label="Exog_scaled")
plt.title("Page Views by Language");
Observations
- We can see that English language has highest number of page views.
- We can also see a surge in views in the month of August for English and Russian languages.
- This can be due to the campaigns run
Individual Time Series Plot
import matplotlib.dates as mdates
fig, ax = plt.subplots(len(columns), 1, figsize=(20, 25))
for i, col in enumerate(columns):
ax[i].plot(df[col])
ax[i].set_title(col, fontsize=16)
ax[i].xaxis.set_major_locator(mdates.MonthLocator())
ax[i].xaxis.set_major_formatter(mdates.DateFormatter('%b'))
plt.setp(ax[i].xaxis.get_majorticklabels(), rotation=45, ha='right')
plt.tight_layout()
plt.show()
Observations
- Above plot shows the time series plot for each language
- We can see that Spanish has double seasonality
- We can see that there were sudden spikes in English, French and Russian language views
- This can be dur to some campaigns run
Distribution of views count
fig, ax = plt.subplots(4, 2, figsize=(20, 15))
for i, col in enumerate(columns):
df[col].hist(ax=ax[i//2, i%2], bins=100)
ax[i//2, i%2].set_title(col, fontsize=16)
# plt.tight_layout()
Observations
- From above plot we can see that Russian has large number of outliers
Stationarity Check¶
AD Fuller Test
def is_stationary(series, significance_level=0.05):
adf_test = adfuller(series)
p_value = adf_test[1]
return p_value < significance_level , p_value
adf_df = pd.DataFrame(columns=["Lang", "p-value", "is_stationary"])
for column in columns:
result, p_value = is_stationary(df[column])
adf_df = pd.concat([adf_df, pd.DataFrame({"Lang": [column], "p-value": p_value, "is_stationary": result})], ignore_index=True)
adf_df.sort_values("p-value", ascending=False)
C:\Users\v-gautamnaik\AppData\Local\Temp\ipykernel_7352\3790388776.py:5: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
adf_df = pd.concat([adf_df, pd.DataFrame({"Lang": [column], "p-value": p_value, "is_stationary": result})], ignore_index=True)
| Lang | p-value | is_stationary | |
|---|---|---|---|
| 3 | Chinese | 0.447261 | False |
| 2 | English | 0.189574 | False |
| 4 | German | 0.139171 | False |
| 1 | Japanese | 0.102580 | False |
| 6 | French | 0.051539 | False |
| 0 | Spanish | 0.033589 | True |
| 5 | Russian | 0.001864 | True |
Observations
- Above table shows which language time series is stationary and which is not.
- Only Russian and Spanish language time series are stationary
Decomposition¶
Decomposition is a useful tool to understand the various components of a time series. We can generate a plot of the decomposition to understand the trend, seasonality and residuals. Benefits of decomposition are:
- It helps in understanding the trend and seasonality in the data
- It helps in understanding the residuals
Basically decomposition helps in understanding the data better by removing the noise and focusing on the trend and seasonality.
English
plt.rcParams['figure.figsize'] = (20, 8)
seasonal_decompose(df["English"], model="additive").plot();
Observations
- Trend: We can see that there is upward trend in the page views count
- Seasonality: Based on the seasonal plot we can see that there is presennce of seasonality
- Residuals: The residuals are spread around 0 suggesting no underlying trend
Spanish
seasonal_decompose(df["Spanish"], model="additive").plot();
Observations
- Trend: We can see that there is no trend in the page views count
- Seasonality: Based on the seasonal plot we can see that there is presennce of seasonality
- Residuals: The residuals are spread around 0 suggesting no underlying trend
Russian
seasonal_decompose(df["Russian"], model="additive").plot();
Observations
- Trend: We can see that there is small trend in the page views count
- Seasonality: Based on the seasonal plot we can see that there is presennce of seasonality
- Residuals: The residuals are tightly spread around 0 suggesting no underlying trend
Japanese
seasonal_decompose(df["Japanese"], model="additive").plot();
Observations
- Trend: We can see that there was a jump in view but then it is constant
- Seasonality: Based on the seasonal plot we can see that there is presennce of seasonality
- Residuals: The residuals are evenly spread around 0 suggesting no underlying trend
French
seasonal_decompose(df["French"], model="additive").plot();
Observations
- Trend: We can see that there is small trend in the page views count
- Seasonality: Based on the seasonal plot we can see that there is presennce of seasonality
- Residuals: The residuals are evenly spread around 0 suggesting no underlying trend
German
seasonal_decompose(df["German"], model="additive").plot();
Observations
- Trend: We can see that there are ups and downs in the page views count but no clear trend
- Seasonality: Based on the seasonal plot we can see that there is presennce of seasonality
- Residuals: The residuals are evenly spread around 0 suggesting no underlying trend
Chinese
seasonal_decompose(df["Chinese"], model="additive").plot();
Observations
- Trend: We can see that there was a jump in view but then it is constant
- Seasonality: Based on the seasonal plot we can see that there is presennce of seasonality
- Residuals: The residuals are evenly spread around 0 suggesting no underlying trend
Removing Stationarity¶
df_st = df.copy()
Single Differencing
df_st[columns] = df_st[columns].diff()
df_st=df_st.dropna()
df_st
| Spanish | Japanese | English | Chinese | German | Russian | French | Exog | |
|---|---|---|---|---|---|---|---|---|
| dates | ||||||||
| 2015-07-02 | -677549.0 | 1757543.0 | -273647.0 | 5722.0 | -180332.0 | 163753.0 | 54155.0 | 0 |
| 2015-07-03 | -1173451.0 | -1315479.0 | -4270869.0 | -27765.0 | -526255.0 | -704444.0 | -326937.0 | 0 |
| 2015-07-04 | -821195.0 | 3150627.0 | 3295544.0 | 39756.0 | -1034445.0 | -530199.0 | 563550.0 | 0 |
| 2015-07-05 | 1104171.0 | -629009.0 | 2735423.0 | 278023.0 | 1872202.0 | 545391.0 | -159355.0 | 0 |
| 2015-07-06 | 1915565.0 | -1905988.0 | 6610795.0 | 23302.0 | 1350615.0 | 690927.0 | 359835.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2016-12-27 | -873730.0 | 205396.0 | 6516584.0 | -7125.0 | -2254510.0 | -1676938.0 | -1396984.0 | 1 |
| 2016-12-28 | 632290.0 | 27686.0 | -4350356.0 | 35275.0 | -972029.0 | -1039437.0 | -1499749.0 | 1 |
| 2016-12-29 | -930529.0 | 1531442.0 | 9279126.0 | -471221.0 | -705093.0 | -521795.0 | -381896.0 | 1 |
| 2016-12-30 | -4086337.0 | 1768803.0 | -25152927.0 | 69821.0 | -839218.0 | -1411561.0 | -928302.0 | 0 |
| 2016-12-31 | -482762.0 | 5009542.0 | -1780781.0 | 186121.0 | -1044700.0 | 1155510.0 | -967081.0 | 0 |
549 rows × 8 columns
adf_df = pd.DataFrame(columns=["Lang", "p-value", "is_stationary"])
for column in columns:
result, p_value = is_stationary(df_st[column])
adf_df = pd.concat([adf_df, pd.DataFrame({"Lang": [column], "p-value": p_value, "is_stationary": result})], ignore_index=True)
adf_df.sort_values("p-value", ascending=False)
C:\Users\v-gautamnaik\AppData\Local\Temp\ipykernel_7352\1076179733.py:5: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
adf_df = pd.concat([adf_df, pd.DataFrame({"Lang": [column], "p-value": p_value, "is_stationary": result})], ignore_index=True)
| Lang | p-value | is_stationary | |
|---|---|---|---|
| 0 | Spanish | 1.499282e-07 | True |
| 4 | German | 2.488111e-10 | True |
| 3 | Chinese | 1.175377e-11 | True |
| 2 | English | 5.285073e-13 | True |
| 6 | French | 1.056177e-17 | True |
| 1 | Japanese | 6.206743e-20 | True |
| 5 | Russian | 2.485269e-30 | True |
Observations
- After applying differencing we can see that all the languages are stationary
ACF and PACF¶
params={}
# params["Spanish"] = {
# "order": (1, 1, 1), p: AR -> PACF , d: Difference, q: MA -> ACF
# "seasonal_order": (1, 1, 1, 12), P: Seasonal AR, D: Seasonal Difference, Q: Seasonal MA, m: Seasonal Period
# }
df_st=pd.read_parquet("../data/processed/df_st.parquet")
English
plt.rcParams['figure.figsize'] = (20, 8)
plot_acf(df_st["English"], lags=100, zero=False);
plot_pacf(df["English"], lags=50, zero=False);
Observations
- We can see that there are spikes at fixed intervals in ACF plot which suggests seasonality
- We can see that there is a spike at lag 1 in PACF plot which suggests AR(1) model
params["English"] = {
"order": (1, 1, 5),
"seasonal_order": (1, 0, 1, 7)
}
Spanish
plot_acf(df_st["Spanish"], lags=100, zero=False);
plot_pacf(df["Spanish"], lags=50, zero=False);
Observations
- We can see that there are spikes at fixed intervals in ACF plot which suggests seasonality
- We can see that there are multiple spikes in PACF plots till lag 6 which suggests AR(6) model
params["Spanish"] = {
"order": (6, 0, 6),
"seasonal_order": (2, 0, 14, 7),
}
Russian
plot_acf(df_st["Russian"], lags=100, zero=False);
plot_pacf(df["Russian"], lags=50, zero=False);
Observations
- We can see that there are small spikes in ACF plot
- We can see that there is a spike at till lag 3 in PACF plot which suggests AR(1) model
params["Russian"] = {
"order": (3, 0, 2),
"seasonal_order": (1, 0, 1, 14),
}
Japanese
plot_acf(df_st["Japanese"], lags=100, zero=False);
plot_pacf(df["Japanese"], lags=50, zero=False);
Observations
- We can see that there are spikes at fixed intervals in ACF plot which suggests seasonality
- We can see that there are spikes till lag 2 in PACF plot which suggests AR(2) model
params["Japanese"] = {
"order": (2, 1, 4),
"seasonal_order": (1, 0, 15, 7),
}
French
plot_acf(df_st["French"], lags=100, zero=False);
plot_pacf(df["French"], lags=50, zero=False);
Observations
- We can see that there are spikes at fixed intervals in ACF plot which suggests seasonality
- We can see that there is are spike till lag 3 in PACF plot which suggests AR(3) model
params["French"] = {
"order": (3, 1, 2),
"seasonal_order": (2, 0, 9, 7),
}
German
plot_acf(df_st["German"], lags=100, zero=False);
plot_pacf(df["German"], lags=50, zero=False);
Observations
- We can see that there are spikes at fixed intervals in ACF plot which suggests seasonality
- We can see that there is a spike at lag 1 in PACF plot which suggests AR(1) model
params["German"] = {
"order": (1, 1, 2),
"seasonal_order": (3, 0, 14, 7),
}
Chinese
plot_acf(df_st["Chinese"], lags=100, zero=False);
plot_pacf(df["Chinese"], lags=50, zero=False);
Observations
- We can see that there are spikes at fixed intervals in ACF plot which suggests seasonality
- We can see that there is a spike at lag 1 in PACF plot which suggests AR(1) model
params["Chinese"] = {
"order": (1, 1, 5),
"seasonal_order": (4, 0, 14, 7),
}
df_st.to_parquet("../data/processed/df_st.parquet")
df.to_parquet("../data/processed/df_pandas.parquet")
params
# import json
# with open("../data/processed/params.json", "w") as f:
# json.dump(params, f)
{'English': {'order': (1, 1, 5), 'seasonal_order': (1, 0, 1, 7)},
'Spanish': {'order': (6, 1, 6), 'seasonal_order': (2, 0, 14, 7)},
'Russian': {'order': (3, 1, 2), 'seasonal_order': (1, 0, 1, 14)},
'Japanese': {'order': (2, 1, 4), 'seasonal_order': (1, 0, 15, 7)},
'French': {'order': (3, 1, 2), 'seasonal_order': (2, 0, 9, 7)},
'German': {'order': (1, 1, 2), 'seasonal_order': (4, 0, 14, 7)},
'Chinese': {'order': (1, 1, 5), 'seasonal_order': (4, 0, 14, 7)}}
Model Building¶
df=pd.read_parquet("../data/processed/df_pandas.parquet")
def performance(actual, predicted):
print('MAE :', round(mae(actual, predicted), 3))
print('RMSE :', round(mse(actual, predicted)**0.5, 3))
print('MAPE:', round(mape(actual, predicted), 3))
# time based split
train = df.iloc[:int(0.8*len(df))]
test = df.iloc[int(0.8*len(df)):]
ARIMA¶
This model uses the MA and AR terms to forecast the time series data. The model is defined by the following parameters:
- p: The number of lag observations included in the model, also called the lag order.
- d: The number of times that the raw observations are differenced, also called the degree of differencing.
- q: The size of the moving average window, also called the order of moving average.
However we are not able to capture the seasonality in the data using ARIMA model and also effect of exogenous variables.
Pipeline to handle all languages
results = []
models={}
for column in columns:
print(f"Training {column}...")
model = SARIMAX(train[column], order=params[column]["order"])
model_fit = model.fit()
forecast = model_fit.forecast(steps=len(test))
# models[column] = model_fit
results.append({
"Language": column,
'MAE': round(mae(test[column], forecast), 3),
'RMSE': round(mse(test[column], forecast)**0.5, 3),
'MAPE': round(mape(test[column], forecast), 3),
'p': params[column]["order"][0],
'd': params[column]["order"][1],
'q': params[column]["order"][2],
})
# Plotting
plt.figure(figsize=(20, 8))
plt.plot(train[column], label="Train")
plt.plot(test[column], label="Test")
plt.plot(forecast, label="Forecast")
plt.title(f"{column} - ARIMA Model")
plt.legend()
plt.show()
# Convert results to DataFrame
results_df = pd.DataFrame(results)
Training Spanish...
Training Japanese...
Training English...
Training Chinese...
Training German...
Training Russian...
Training French...
c:\Users\v-gautamnaik\AppData\Local\Programs\Python\Python311\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:966: UserWarning: Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.
warn('Non-stationary starting autoregressive parameters'
c:\Users\v-gautamnaik\AppData\Local\Programs\Python\Python311\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:978: UserWarning: Non-invertible starting MA parameters found. Using zeros as starting parameters.
warn('Non-invertible starting MA parameters found.'
Observations
- We can see that ARIMA model based on parameters extracted from ACF and PACF plots is able to capture the general trend in the data
- However it is not able to capture the seasonality in the data for some languages
Results
results_df.sort_values("MAPE")
| Language | MAE | RMSE | MAPE | p | d | q | |
|---|---|---|---|---|---|---|---|
| 2 | English | 8306097.860 | 1.210796e+07 | 0.066 | 1 | 1 | 5 |
| 3 | Chinese | 438204.793 | 5.718531e+05 | 0.066 | 1 | 1 | 5 |
| 6 | French | 1140287.880 | 1.596890e+06 | 0.085 | 3 | 1 | 2 |
| 1 | Japanese | 1524784.613 | 1.889121e+06 | 0.097 | 2 | 1 | 1 |
| 4 | German | 1752924.870 | 2.408183e+06 | 0.098 | 1 | 1 | 2 |
| 5 | Russian | 1657838.177 | 2.290341e+06 | 0.107 | 3 | 1 | 2 |
| 0 | Spanish | 2348818.160 | 3.101108e+06 | 0.153 | 6 | 1 | 6 |
Observations
Above table shows the MAPE for each language using ARIMA model
SARIMA¶
This model is an extension of ARIMA model which also captures the seasonality in the data. The model contains the same parameter as ARIMA model with additional parameters for seasonality. Following are the additional parameters:
- P: Seasonal lag order
- D: Seasonal difference order
- Q: Seasonal moving average order
- S: The number of time steps for a single seasonal period
This is more advanced model as compared to ARIMA model. However it is not able to capture the effect of exogenous variables.
Pipeline to handle all languages
results = []
models={}
cache = {}
for column in columns:
print(f"Training {column}...")
model = SARIMAX(train[column], order=params[column]["order"], seasonal_order=params[column]["seasonal_order"])
model_fit = model.fit()
forecast = model_fit.forecast(steps=len(test))
models[column] = model_fit
results.append({
"Language": column,
'MAE': round(mae(test[column], forecast), 3),
'RMSE': round(mse(test[column], forecast)**0.5, 3),
'MAPE': round(mape(test[column], forecast), 3),
'p': params[column]["order"][0],
'd': params[column]["order"][1],
'q': params[column]["order"][2],
'P': params[column]["seasonal_order"][0],
'D': params[column]["seasonal_order"][1],
'Q': params[column]["seasonal_order"][2],
'S': params[column]["seasonal_order"][3],
})
# Plotting
plt.figure(figsize=(20, 8))
plt.plot(train[column], label="Train")
plt.plot(test[column], label="Test")
plt.plot(forecast, label="Forecast")
plt.title(f"{column} - SARIMA Model")
plt.legend()
plt.show()
# Convert results to DataFrame
results_df = pd.DataFrame(results)
Training Spanish...
c:\Users\v-gautamnaik\AppData\Local\Programs\Python\Python311\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:1009: UserWarning: Non-invertible starting seasonal moving average Using zeros as starting parameters.
warn('Non-invertible starting seasonal moving average'
Training Japanese...
c:\Users\v-gautamnaik\AppData\Local\Programs\Python\Python311\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:1009: UserWarning: Non-invertible starting seasonal moving average Using zeros as starting parameters.
warn('Non-invertible starting seasonal moving average'
Training English...
Training Chinese...
Training German...
Training Russian...
Training French...
c:\Users\v-gautamnaik\AppData\Local\Programs\Python\Python311\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:966: UserWarning: Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.
warn('Non-stationary starting autoregressive parameters'
c:\Users\v-gautamnaik\AppData\Local\Programs\Python\Python311\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:978: UserWarning: Non-invertible starting MA parameters found. Using zeros as starting parameters.
warn('Non-invertible starting MA parameters found.'
Observations
- We can see that SARIMA model is able to capture the seasonality in the data. We used the parameters extracted from ACF and PACF plots to build the model.
- With proper tuning of parameters we will be able to better capture the seasonality in the data
Results
results_df.sort_values("MAPE")
| Language | MAE | RMSE | MAPE | p | d | q | P | D | Q | S | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Chinese | 4.226808e+05 | 5.364987e+05 | 0.065 | 1 | 1 | 5 | 4 | 0 | 14 | 7 |
| 6 | French | 1.247096e+06 | 1.711687e+06 | 0.092 | 3 | 1 | 2 | 2 | 0 | 9 | 7 |
| 0 | Spanish | 1.718605e+06 | 2.293428e+06 | 0.097 | 6 | 1 | 6 | 2 | 0 | 14 | 7 |
| 1 | Japanese | 1.599290e+06 | 1.961434e+06 | 0.102 | 2 | 1 | 4 | 1 | 0 | 15 | 7 |
| 2 | English | 1.214013e+07 | 1.400092e+07 | 0.103 | 1 | 1 | 5 | 1 | 0 | 1 | 7 |
| 5 | Russian | 2.022044e+06 | 2.622408e+06 | 0.132 | 3 | 1 | 2 | 1 | 0 | 1 | 14 |
| 4 | German | 2.456478e+06 | 3.004588e+06 | 0.141 | 1 | 1 | 2 | 4 | 0 | 14 | 7 |
SARIMAX¶
This model is an extension of SARIMA model which also captures the effect of exogenous variables. The model contains the same parameter as SARIMA model with additional parameters for exogenous variables.
The additional parameters are:
- exog: The exogenous variable to be included in the model
Using this model we are able to capture the effect of exogenous variables on the time series data. This gives us more accurate forecast as compared to ARIMA and SARIMA models.
params["English"] = { "order": (6, 1, 6), "seasonal_order": (1, 0, 1, 7) }
model = SARIMAX(train["English"], order=params["English"]["order"], exog=train["Exog"], seasonal_order= params["English"]["seasonal_order"] )
model_fit = model.fit()
forecast = model_fit.forecast(steps=len(test), exog=test["Exog"])
performance(test["English"], forecast)
0.052988798444877507
# Plotting
plt.figure(figsize=(20, 8))
plt.plot(train["English"], label="Train")
plt.plot(test["English"], label="Test")
plt.plot(forecast, label="Forecast")
# plt.xlim(test.index.min()-timedelta(days=50), test.index.max())
plt.title("English - SARIMAX Model")
plt.legend()
plt.show()
Zoomed in plot for English language prediction using exogenous variable
# Plotting
plt.figure(figsize=(20, 8))
plt.plot(train["English"], label="Train")
plt.plot(test["English"], label="Test")
plt.plot(forecast, label="Forecast")
plt.xlim(test.index.min()-timedelta(days=50), test.index.max())
plt.title("English - SARIMAX Model")
plt.legend()
plt.show()
Observations
- We can see that SARIMAX model has predicted the views count and has captured the trend, seasonality, along with the exogenous variable effect
Hyperparameter Tuning using Optuna¶
def performance(true_values, predictions):
return mape(true_values, predictions)
def objective(trial):
p = trial.suggest_int('p', 0, 7) # AR
# d = trial.suggest_int('d', 0, 2) # Diff
q = trial.suggest_int('q', 0, 7) # MA
P = trial.suggest_int('P', 0, 2) # Seasonal AR
D = trial.suggest_int('D', 0, 1) # Seasonal diff
Q = trial.suggest_int('Q', 0, 2) # Seasonal MA
S = 7 # weekly seasonality
order = (p, 1, q)
seasonal_order = (P, D, Q, S)
try:
model = SARIMAX(train["English"], order=order, exog=train["Exog"], seasonal_order=seasonal_order)
model_fit = model.fit(disp=False)
forecast = model_fit.forecast(steps=len(test), exog=test["Exog"])
error = performance(test["English"], forecast)
except Exception as e:
error = np.inf
return error
storage_name = "sqlite:///sarimax_optuna_study.db"
study = optuna.create_study(direction='minimize', study_name="sarimax_study", storage=storage_name, load_if_exists=True)
study.optimize(objective, n_trials=100)
print("Best hyperparameters: ", study.best_trial.params)
print("Best MAPE: ", study.best_value)
Best hyperparameters: {'p': 3, 'q': 5, 'P': 0, 'D': 1, 'Q': 1}
Best MAPE: 0.04394953148191757
params["English"] = { "order": (3, 1, 5), "seasonal_order": (0, 1, 1, 7) }
model = SARIMAX(train["English"], order=params["English"]["order"], exog=train["Exog"], seasonal_order= params["English"]["seasonal_order"] )
model_fit = model.fit()
forecast = model_fit.forecast(steps=len(test), exog=test["Exog"])
performance(test["English"], forecast)
0.04394953148191757
plt.figure(figsize=(20, 8))
plt.plot(train["English"], label="Train")
plt.plot(test["English"], label="Test")
plt.plot(forecast, label="Forecast")
# plt.xlim(test.index.min()-timedelta(days=50), test.index.max())
plt.title("English - SARIMAX Model")
plt.legend()
plt.show()
# Plotting
plt.figure(figsize=(20, 8))
plt.plot(train["English"], label="Train")
plt.plot(test["English"], label="Test")
plt.plot(forecast, label="Forecast")
plt.xlim(test.index.min()-timedelta(days=50), test.index.max())
plt.title("English - SARIMAX Model")
plt.legend()
plt.show()
Observations
- Using Optuna for hyperparameter tuning we can see that the MAPE has decreased to 0.043 from 0.052 for English language
Prophet Model¶
Pipeline to handle all languages
# prophet for all languages
results = []
for column in columns:
fb_df = pd.DataFrame({
"ds": df.index,
"y": df[column],
}).reset_index(drop=True)
train = fb_df.iloc[:int(0.8*len(fb_df))]
test = fb_df.iloc[int(0.8*len(fb_df)):]
m = Prophet( weekly_seasonality=True )
m.fit(train)
future = m.make_future_dataframe(periods=len(test), freq="D")
forecast = m.predict(future)
forecast = forecast.set_index("ds")
train = train.set_index("ds")
test = test.set_index("ds")
plt.figure(figsize=(20, 8))
plt.plot(train["y"], label="Train")
plt.plot(test["y"], label="Test")
plt.plot(forecast["yhat"], label="Forecast")
plt.fill_between(forecast.index, forecast["yhat_lower"], forecast["yhat_upper"], color='violet', alpha=0.2, label="Confidence Interval")
plt.title(f"{column} - Prophet Model")
plt.legend()
plt.show()
results.append({
"Language": column,
'MAPE': performance(test["y"], forecast.loc[test.index, "yhat"]),
})
02:28:40 - cmdstanpy - INFO - Chain [1] start processing 02:28:40 - cmdstanpy - INFO - Chain [1] done processing
02:28:41 - cmdstanpy - INFO - Chain [1] start processing 02:28:41 - cmdstanpy - INFO - Chain [1] done processing
02:28:41 - cmdstanpy - INFO - Chain [1] start processing 02:28:41 - cmdstanpy - INFO - Chain [1] done processing
02:28:42 - cmdstanpy - INFO - Chain [1] start processing 02:28:42 - cmdstanpy - INFO - Chain [1] done processing
02:28:42 - cmdstanpy - INFO - Chain [1] start processing 02:28:42 - cmdstanpy - INFO - Chain [1] done processing
02:28:43 - cmdstanpy - INFO - Chain [1] start processing 02:28:43 - cmdstanpy - INFO - Chain [1] done processing
02:28:43 - cmdstanpy - INFO - Chain [1] start processing 02:28:43 - cmdstanpy - INFO - Chain [1] done processing
Results
pd.DataFrame(results).sort_values("MAPE")
| Language | MAPE | |
|---|---|---|
| 3 | Chinese | 0.079152 |
| 1 | Japanese | 0.123545 |
| 0 | Spanish | 0.147672 |
| 6 | French | 0.149335 |
| 4 | German | 0.154803 |
| 5 | Russian | 0.447796 |
| 2 | English | 0.448351 |
Prophet model using exogenous variable
fb_df_english = pd.DataFrame({
"ds": df.index,
"y": df["English"],
"regressor": df["Exog"]
}).reset_index(drop=True)
train = fb_df_english.iloc[:int(0.8*len(fb_df_english))]
test = fb_df_english.iloc[int(0.8*len(fb_df_english)):]
m = Prophet( weekly_seasonality=True)
m.add_regressor('regressor')
m.fit(train)
future = m.make_future_dataframe(periods=len(test))
future["regressor"] = fb_df_english["regressor"]
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper','regressor']].tail()
forecast = forecast.set_index("ds")
train = train.set_index("ds")
test = test.set_index("ds")
plt.figure(figsize=(20, 8))
plt.plot(train["y"], label="Train")
plt.plot(test["y"], label="Test")
plt.plot(forecast["yhat"], label="Forecast")
plt.fill_between(forecast.index, forecast["yhat_lower"], forecast["yhat_upper"], color='violet', alpha=0.2, label="Confidence Interval")
plt.title("English - Prophet Model")
plt.legend()
plt.show()
02:49:37 - cmdstanpy - INFO - Chain [1] start processing 02:49:37 - cmdstanpy - INFO - Chain [1] done processing
Results
performance(test["y"], forecast.loc[test.index, "yhat"])
0.06769322066744596
Observations
- Using exogenous variable we are better able to capture the trend and seasonality in the data
Insights and Recomendation¶
AdEase can use SARIMAX model to forecast the number of views for different languages. For every language, a new model has to be built. Using the forecasted values, AdEase can optimize the ad placement for their clients.
Stationarity
We used AD Fuller test to check stationalrity of the data. Only Russian and Spanish language time series were stationary. We had to use 1st order differencing for all other languages to make the time series stationary
Page Views
- Amongts all the languages, English articles had the highest number of views. This may be becasuse English is the most popular language. Ads can be placed on English articles to get maximum views.
- Also the homepage of wikipedia has the highest number of views. AdEase can place ads on the homepage to get maximum views.
Monthly and Weekly Trends
- For Spanish language October, September and November have the highest number of views
- For japanese September, October, August and December have the highest number of views
- English August, July have the highest number of views
- For Chinese December, November, October have the highest number of views
- For German December and November have the highest number of views
- for Russian August and July have the highest number of views
- For French December and November have the highest number of views
Above months can be considered as the peak months for the respective languages and ad placement can be optimized during these months
- In most cases Sundays have the highest number of views
Whenever there was a campaign, the number of views increased significantly. This can seen in English time series.